Efficient location and identification of documents in images

ABSTRACT

Efficient location and identification of documents in images. In an embodiment, at least one quadrangle is extracted from an image based on line(s) extracted from the image. Parameter(s) are determined from the quadrangle(s), and keypoints are extracted from the image based on the parameter(s). Input descriptors are calculated for the keypoints and used to match the keypoints to reference keypoints, to identify classification candidate(s) that represent a template image of a type of document. The type of document and distortion parameter(s) are determined based on the classification candidate(s).

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 17/237,596, filed on Apr. 22, 2021, which claims priority toRussian Patent App. No. 2020129039, filed on Sep. 2, 2020, which areboth hereby incorporated herein by reference as if set forth in full.

BACKGROUND Field of the Invention

The embodiments described herein are generally directed to imagerecognition, and, more particularly, to an efficient means for locatingand identifying the type of document in an image.

Description of the Related Art

In the modern world, various organizations require the constanttransmission and verification of personal data. For example, governmentagencies may require personal data of an individual in order to checkand receive payment for taxes and fines owed by the individual, processapplications by the individual for government services, verify theidentity of the individual as he or she passes through a security point,and/or the like. Similarly, private companies may require personal dataof an individual in order to reserve transportation tickets (e.g., on anairline or railway) and hotel accommodations, process applications bythe individual for private services (e.g., insurance, loans, etc.),and/or the like.

Automated document-recognition systems can be used to expedite documentprocessing and reduce the risk of document fraud (e.g., identity fraud,forgery of documents, etc.). However, it would be especiallyadvantageous if such systems could be adapted for use on mobile devices.Mobile devices are a less expensive, faster, and more convenientalternative to the bulky specialized hardware scanners that aretypically required for document-based acquisition of personal data(e.g., from identity documents).

In addition, the incorporation of automated document recognition withinmobile devices would enable personal-data acquisition to be integratedinto larger services intended for consumption by end users. However, inmany applications, the functionality of the automated documentrecognition would have to be provided even in conditions of limitedconnectivity (e.g., when the mobile device does not have a networkconnection). Furthermore, in some jurisdictions, regulatory restrictionsmay prohibit or restrict the storage and transmission of personal data.For example, the law of the Russian Federation requires personal data ofcitizens to be stored and processed on systems located in the Russianterritory, and restricts the simultaneous transmission of personal andbiometric data. Thus, it would be advantageous for the automateddocument recognition to be performed directly on the mobile device, suchthat no network connection or transmission is required.

SUMMARY

Accordingly, systems, methods, and non-transitory computer-readablemedia are disclosed for efficient recognition of the position andclassification of documents in images. The disclosed techniques may beparticularly suited for situations in which there is limited resources,such as a mobile device with limited processing and/or memory resources,no network connection a weak, slow, or low-bandwidth connection, or anotherwise limited connection network, and/or the like.

In an embodiment, a method is disclosed that comprises using at leastone hardware processor to: receive an input image; extract one or morelines from the input image; extract at least one quadrangle from theinput image based on the one or more lines; determine one or moreparameters based on the at least one quadrangle; extract a plurality ofinput keypoints from the input image based on the one or moreparameters; calculate an input descriptor for each of the plurality ofkeypoints; match the plurality of input keypoints to a plurality ofreference keypoints in a reference database, based on the inputdescriptor calculated for each of the plurality of input keypoints, toidentify one or more classification candidates, wherein each of the oneor more classification candidates represents a template image of a typeof document; determine the type of document in the input image and oneor more distortion parameters for the document based on the one or moreclassification candidates; and output the determined type of documentand one or more distortion parameters. The one or more distortionparameters may comprise a homography matrix.

The reference database may comprise a plurality of sets of referencekeypoints and descriptors, wherein each of the plurality of setsrepresents one of a plurality of template images, and wherein each ofthe plurality of template images represents one of a plurality of typesof document. At least one of the plurality of template images may berepresented by at least four sets of reference keypoints and descriptorsin the reference database, wherein each of the four sets represents oneof the plurality of types of document rotated by a different amount ofrotation than all others of the four sets. The different amounts ofrotation for the four sets may comprise 0°, 90°, 180°, and 270°.

The method may further comprise, for each of the plurality of types ofdocument: receiving the template image representing that type ofdocument; extracting a plurality of reference keypoints from thetemplate image; calculating a reference descriptor for each of theplurality of reference keypoints; and storing a compact representationof the template image in the reference database, wherein the compactrepresentation comprises the plurality of reference keypoints and thereference descriptors calculated for the plurality of referencekeypoints. Each reference descriptor may be stored in a hierarchicalclustering tree. Extracting a plurality of reference keypoints from thetemplate image may comprise excluding any keypoints that are within aregion of the template image that has been identified as representing afield of variable data. Extracting a plurality of reference keypointsfrom the template image may comprise selecting the plurality ofreference keypoints by: calculating a score for a plurality of candidatekeypoints using a Yet Another Contrast-Invariant Point Extractor(YACIPE) algorithm; and selecting a subset of the plurality of candidatekeypoints with highest scores as the plurality of reference keypoints.Calculating a reference descriptor for each of the plurality ofreference keypoints may comprise calculating a receptive fielddescriptor for an image region around the reference keypoint. Eachreference descriptor may comprise a vector of binary features.

The method may further comprise extracting data from the input imagebased on the determined type of document and the one or more distortionparameters. The extracted data may comprise one or more of text, animage, or a table.

The one or more classification candidates may comprise a plurality ofclassification candidates, and the method may further comprise using theat least one hardware processor to: calculate a rank for each of theplurality of classification candidates; and select one of the pluralityof classification candidates based on the calculated ranks, whereindetermining the type of document in the input image comprisesidentifying a type of document associated with the selectedclassification candidate. Selecting one of the plurality ofclassification candidates may comprise: selecting a subset of theplurality of classification candidates that have highest calculatedranks; performing a geometric validation with at least one of theclassification candidates in the selected subset to identify theclassification candidate as valid or invalid; and selecting one of theclassification candidates, having a maximum calculated rank, from theclassification candidates in the selected subset that are identified asvalid. The geometric validation with each of the one or moreclassification candidates may comprise: calculating a transformationmatrix that maps input keypoints in the input image to referencekeypoints in the template image represented by the classificationcandidate; when the mapping is within a predefined accuracy, determiningthat the transformation matrix is valid; and, when the mapping is notwithin the predefined accuracy, determining that the transformationmatrix is invalid. The transformation matrix may be a Random SampleConsensus (RANSAC) transformation matrix. For each of the one or moreclassification candidates, the transformation matrix may transformvertices of the at least one quadrangle to corners of the template imagerepresented by the classification candidate. For each of the one or moreclassification candidates, the transformation matrix may be constrainedby one or both of the following: a distance between any two referencekeypoints is greater than a minimum distance threshold; or the at leastone quadrangle is convex and no vertices of the at least one quadranglelie outside the input image by more than a maximum distance threshold.

Extracting one or more lines from the input image may comprise applyinga Hough transform to transform at least a portion of the input imageinto a Hough parameter space. Extracting one or more lines from theinput image may comprise, for each of a plurality of regions of interestin the input image, calculating a grayscale boundaries map, and,iteratively until a predefined number of candidate lines are identified,applying a Fast Hough Transform to the boundaries map to produce a Houghparameter space, identifying a candidate line with a highest value inthe Hough parameter space, and, if a number of identified candidatelines is less than the predefined number of candidate lines, erasingboundaries in a neighborhood of the identified candidate line with thehighest value in the Hough parameter space. Extracting at least onequadrangle from the input image may comprise generating a plurality ofcandidate quadrangles using pairwise intersection of the identifiedcandidate lines across at least two of the plurality of regions ofinterest, for each of the plurality of candidate quadrangles,calculating a weight for the candidate quadrangle based on weightsassociated with constituent lines of the candidate quadrangle, andselecting the at least one quadrangle from the plurality of candidatequadrangles based on the calculated weights for the plurality ofcandidate quadrangles. The one or more lines may comprise a plurality oflines, wherein extracting at least one quadrangle from the input imagecomprises: classifying each of the plurality of lines as either mostlyhorizontal or mostly vertical; generating an intersections graph basedon the classifications of the plurality of lines, wherein each vertex inthe intersections graph represents one of the plurality of lines, andwherein each edge in the intersections graph represents an intersectionpoint of two of the plurality of lines; tagging each edge in theintersections graph with a corner type that corresponds to therepresented intersection point; identifying one or more cycles in theintersections graph, wherein each cycle comprises four edges that areall tagged with different corner types; and selecting one of the one ormore cycles as the at least one quadrangle based on a weighting.

The one or more parameters may comprise one or both of a scale and arotation angle of the at least one quadrangle. Extracting the at leastone quadrangle may comprise determining a weight for each of one or morecandidate quadrangles extracted from the input image based on the one ormore lines, and identifying one of the one or more candidate quadrangleswith a highest weight as the at least one quadrangle. Determining one ormore parameters based on the at least one quadrangle may comprise whenthe weight for the at least one quadrangle is greater than a predefinedthreshold, using a scale and rotation angle of the at least onequadrangle as the scale and the rotation angle in the one or moreparameters, and, when the weight is less than the predefined threshold,using a default value for the scale in the one or more parameters, anddetermining the rotation angle in the one or more parameters by sortingconstituent lines of the one or more candidate quadrangles that complywith geometric restrictions by an angle between each constituent lineand a reference line, identifying a single constituent line for which anangular window includes a maximum number of constituent lines, and usingthe angle between the identified constituent line and the reference lineas the rotation angle in the one or more parameters.

Any of the methods may be embodied in executable software modules of aprocessor-based system, such as a server, and/or in executableinstructions stored in a non-transitory computer-readable medium.

BRIEF DESCRIPTION OF THE DRAWINGS

The details of the present invention, both as to its structure andoperation, may be gleaned in part by study of the accompanying drawings,in which like reference numerals refer to like parts, and in which:

FIG. 1 illustrates an example infrastructure in which one or more of theprocesses described herein may be implemented, according to anembodiment;

FIG. 2 illustrates an example processing system, by which one or more ofthe processes described herein, may be executed, according to anembodiment;

FIG. 3 illustrates a process in which document recognition is performed,according to an embodiment;

FIG. 4 illustrates a document-recognition process, according to anembodiment; and

FIGS. 5A and 5B illustrate examples of template and input images in thecontext of document recognition, according to an embodiment.

DETAILED DESCRIPTION

In an embodiment, systems, methods, and non-transitory computer-readablemedia are disclosed for efficient recognition of documents in images.After reading this description, it will become apparent to one skilledin the art how to implement the invention in various alternativeembodiments and alternative applications. However, although variousembodiments of the present invention will be described herein, it isunderstood that these embodiments are presented by way of example andillustration only, and not limitation. As such, this detaileddescription of various embodiments should not be construed to limit thescope or breadth of the present invention as set forth in the appendedclaims.

1. System Overview

1.1. Infrastructure

FIG. 1 illustrates an example infrastructure in which one or more of theprocesses described herein may be implemented, according to anembodiment. The infrastructure may comprise a platform 110 (e.g., one ormore servers) which hosts and/or executes one or more of the variousfunctions, processes, methods, and/or software modules described herein.Platform 110 may comprise dedicated servers, or may instead comprisecloud instances, which utilize shared resources of one or more servers.These servers or cloud instances may be collocated and/or geographicallydistributed. Platform 110 may also comprise or be communicativelyconnected to a server application 112 and/or one or more databases 114.In addition, platform 110 may be communicatively connected to one ormore user systems 130 via one or more networks 120. Platform 110 mayalso be communicatively connected to one or more external systems 140(e.g., other platforms, websites, etc.) via one or more networks 120.

Network(s) 120 may comprise the Internet, and platform 110 maycommunicate with user system(s) 130 through the Internet using standardtransmission protocols, such as HyperText Transfer Protocol (HTTP), HTTPSecure (HTTPS), File Transfer Protocol (FTP), FTP Secure (FTPS), SecureShell FTP (SFTP), and the like, as well as proprietary protocols. Whileplatform 110 is illustrated as being connected to various systemsthrough a single set of network(s) 120, it should be understood thatplatform 110 may be connected to the various systems via different setsof one or more networks. For example, platform 110 may be connected to asubset of user systems 130 and/or external systems 140 via the Internet,but may be connected to one or more other user systems 130 and/orexternal systems 140 via an intranet. Furthermore, while only a few usersystems 130 and external systems 140, one server application 112, andone set of database(s) 114 are illustrated, it should be understood thatthe infrastructure may comprise any number of user systems, externalsystems, server applications, and databases.

User system(s) 130 may comprise any type or types of computing devicescapable of wired and/or wireless communication, including withoutlimitation, desktop computers, laptop computers, tablet computers,smartphones or other mobile devices, servers, game consoles,televisions, set-top boxes, electronic kiosks, point-of-sale terminals,Automated Teller Machines, and/or the like. In a primary client-sideembodiment, described herein, it is contemplated that user system 130would normally comprise a smartphone, tablet computer, or other mobiledevice, whereas in a primary server-side embodiment, described herein,it is contemplated that user system 130 would normally comprise asmartphone, an image scanner, a desktop or laptop computer connected toan image scanner, or the like.

Platform 110 may comprise web servers which host one or more websitesand/or web services. In embodiments in which a website is provided, thewebsite may comprise a graphical user interface, including, for example,one or more screens (e.g., webpages) generated in HyperText MarkupLanguage (HTML) or other language. Platform 110 transmits or serves oneor more screens of the graphical user interface in response to requestsfrom user system(s) 130. In some embodiments, these screens may beserved in the form of a wizard, in which case two or more screens may beserved in a sequential manner, and one or more of the sequential screensmay depend on an interaction of the user or user system 130 with one ormore preceding screens. The requests to platform 110 and the responsesfrom platform 110, including the screens of the graphical userinterface, may both be communicated through network(s) 120, which mayinclude the Internet, using standard communication protocols (e.g.,HTTP, HTTPS, etc.). These screens (e.g., webpages) may comprise acombination of content and elements, such as text, images, videos,animations, references (e.g., hyperlinks), frames, inputs (e.g.,textboxes, text areas, checkboxes, radio buttons, drop-down menus,buttons, forms, etc.), scripts (e.g., JavaScript), and the like,including elements comprising or derived from data stored in one or moredatabases (e.g., database(s) 114) that are locally and/or remotelyaccessible to platform 110. Platform 110 may also respond to otherrequests from user system(s) 130.

Platform 110 may further comprise, be communicatively coupled with, orotherwise have access to one or more database(s) 114. For example,platform 110 may comprise one or more database servers which manage oneor more databases 114. A user system 130 or server application 112executing on platform 110 may submit data (e.g., user data, form data,etc.) to be stored in database(s) 114, and/or request access to datastored in database(s) 114. Any suitable database may be utilized,including without limitation MySQL™, Oracle™ IBM™, Microsoft SQL™,Access™, PostgreSQL™, and the like, including cloud-based databases andproprietary databases. Data may be sent to platform 110, for instance,using the well-known POST request supported by HTTP, via FTP, and/or thelike. This data, as well as other requests, may be handled, for example,by server-side web technology, such as a servlet or other softwaremodule (e.g., comprised in server application 112), executed by platform110.

In embodiments in which a web service is provided, platform 110 mayreceive requests from external system(s) 140, and provide responses ineXtensible Markup Language (XML), JavaScript Object Notation (JSON),and/or any other suitable or desired format. In such embodiments,platform 110 may provide an application programming interface (API)which defines the manner in which user system(s) 130 and/or externalsystem(s) 140 may interact with the web service. Thus, user system(s)130 and/or external system(s) 140 (which may themselves be servers), candefine their own user interfaces, and rely on the web service toimplement or otherwise provide the backend processes, methods,functionality, storage, and/or the like, described herein. For example,in such an embodiment, client application 132 executing on one or moreuser system(s) 130 may interact with server application 112 executing onplatform 110 to execute one or more or a portion of one or more of thevarious functions, processes, methods, and/or software modules describedherein. Client application 132 may be “thin,” in which case processingis primarily carried out server-side by server application 112 onplatform 110. A basic example of a thin client application is a browserapplication, which simply requests, receives, and renders webpages atuser system(s) 130, while the server application on platform 110 isresponsible for generating the webpages and managing database functions.Alternatively, the client application may be “thick,” in which caseprocessing is primarily carried out client-side by user system(s) 130.It should be understood that client application 132 may perform anamount of processing, relative to server application 112 on platform110, at any point along this spectrum between “thin” and “thick,”depending on the design goals of the particular implementation. In anycase, the application described herein, which may wholly reside oneither platform 110 (e.g., in which case server application 112 performsall processing) or user system(s) 130 (e.g., in which case clientapplication 132 performs all processing) or be distributed betweenplatform 110 and user system(s) 130 (e.g., in which case serverapplication 112 and client application 132 both perform processing), cancomprise one or more executable software modules that implement one ormore of the functions, processes, or methods of the applicationdescribed herein.

1.2. Example Processing Device

FIG. 2 is a block diagram illustrating an example wired or wirelesssystem 200 that may be used in connection with various embodimentsdescribed herein. For example, system 200 may be used as or inconjunction with one or more of the functions, processes, or methods(e.g., to store and/or execute the application or one or more softwaremodules of the application) described herein, and may representcomponents of platform 110, user system(s) 130, external system(s) 140,and/or other processing devices described herein. System 200 can be aserver or any conventional personal computer, or any otherprocessor-enabled device that is capable of wired or wireless datacommunication. Other computer systems and/or architectures may be alsoused, as will be clear to those skilled in the art.

System 200 preferably includes one or more processors, such as processor210. Additional processors may be provided, such as an auxiliaryprocessor to manage input/output, an auxiliary processor to performfloating-point mathematical operations, a special-purpose microprocessorhaving an architecture suitable for fast execution of signal-processingalgorithms (e.g., digital-signal processor), a slave processorsubordinate to the main processing system (e.g., back-end processor), anadditional microprocessor or controller for dual or multiple processorsystems, and/or a coprocessor. Such auxiliary processors may be discreteprocessors or may be integrated with processor 210. Examples ofprocessors which may be used with system 200 include, withoutlimitation, the Pentium® processor, Core i7® processor, and Xeon®processor, all of which are available from Intel Corporation of SantaClara, Calif.

Processor 210 is preferably connected to a communication bus 205.Communication bus 205 may include a data channel for facilitatinginformation transfer between storage and other peripheral components ofsystem 200. Furthermore, communication bus 205 may provide a set ofsignals used for communication with processor 210, including a data bus,address bus, and/or control bus (not shown). Communication bus 205 maycomprise any standard or non-standard bus architecture such as, forexample, bus architectures compliant with industry standard architecture(ISA), extended industry standard architecture (EISA), Micro ChannelArchitecture (MCA), peripheral component interconnect (PCI) local bus,standards promulgated by the Institute of Electrical and ElectronicsEngineers (IEEE) including IEEE 488 general-purpose interface bus (GPM),IEEE 696/S-100, and/or the like.

System 200 preferably includes a main memory 215 and may also include asecondary memory 220. Main memory 215 provides storage of instructionsand data for programs executing on processor 210, such as one or more ofthe functions and/or modules discussed herein. It should be understoodthat programs stored in the memory and executed by processor 210 may bewritten and/or compiled according to any suitable language, includingwithout limitation C/C++, Java, JavaScript, Perl, Visual Basic, .NET,and the like. Main memory 215 is typically semiconductor-based memorysuch as dynamic random access memory (DRAM) and/or static random accessmemory (SRAM). Other semiconductor-based memory types include, forexample, synchronous dynamic random access memory (SDRAM), Rambusdynamic random access memory (RDRAM), ferroelectric random access memory(FRAM), and the like, including read only memory (ROM).

Secondary memory 220 may optionally include an internal medium 225and/or a removable medium 230. Removable medium 230 is read from and/orwritten to in any well-known manner. Removable storage medium 230 maybe, for example, a magnetic tape drive, a compact disc (CD) drive, adigital versatile disc (DVD) drive, other optical drive, a flash memorydrive, and/or the like.

Secondary memory 220 is a non-transitory computer-readable medium havingcomputer-executable code (e.g., disclosed software modules) and/or otherdata stored thereon. The computer software or data stored on secondarymemory 220 is read into main memory 215 for execution by processor 210.

In alternative embodiments, secondary memory 220 may include othersimilar means for allowing computer programs or other data orinstructions to be loaded into system 200. Such means may include, forexample, a communication interface 240, which allows software and datato be transferred from external storage medium 245 to system 200.Examples of external storage medium 245 may include an external harddisk drive, an external optical drive, an external magneto-opticaldrive, and/or the like. Other examples of secondary memory 220 mayinclude semiconductor-based memory, such as programmable read-onlymemory (PROM), erasable programmable read-only memory (EPROM),electrically erasable read-only memory (EEPROM), and flash memory(block-oriented memory similar to EEPROM).

As mentioned above, system 200 may include a communication interface240. Communication interface 240 allows software and data to betransferred between system 200 and external devices (e.g. printers),networks, or other information sources. For example, computer softwareor executable code may be transferred to system 200 from a networkserver (e.g., platform 110) via communication interface 240. Examples ofcommunication interface 240 include a built-in network adapter, networkinterface card (NIC), Personal Computer Memory Card InternationalAssociation (PCMCIA) network card, card bus network adapter, wirelessnetwork adapter, Universal Serial Bus (USB) network adapter, modem, awireless data card, a communications port, an infrared interface, anIEEE 1394 fire-wire, and any other device capable of interfacing system200 with a network (e.g., network(s) 120) or another computing device.Communication interface 240 preferably implements industry-promulgatedprotocol standards, such as Ethernet IEEE 802 standards, Fiber Channel,digital subscriber line (DSL), asynchronous digital subscriber line(ADSL), frame relay, asynchronous transfer mode (ATM), integrateddigital services network (ISDN), personal communications services (PCS),transmission control protocol/Internet protocol (TCP/IP), serial lineInternet protocol/point to point protocol (SLIP/PPP), and so on, but mayalso implement customized or non-standard interface protocols as well.

Software and data transferred via communication interface 240 aregenerally in the form of electrical communication signals 255. Thesesignals 255 may be provided to communication interface 240 via acommunication channel 250. In an embodiment, communication channel 250may be a wired or wireless network (e.g., network(s) 120), or anyvariety of other communication links. Communication channel 250 carriessignals 255 and can be implemented using a variety of wired or wirelesscommunication means including wire or cable, fiber optics, conventionalphone line, cellular phone link, wireless data communication link, radiofrequency (“RF”) link, or infrared link, just to name a few.

Computer-executable code (e.g., computer programs, such as the disclosedapplication, or software modules) is stored in main memory 215 and/orsecondary memory 220. Computer programs can also be received viacommunication interface 240 and stored in main memory 215 and/orsecondary memory 220. Such computer programs, when executed, enablesystem 200 to perform the various functions of the disclosed embodimentsas described elsewhere herein.

In this description, the term “computer-readable medium” is used torefer to any non-transitory computer-readable storage media used toprovide computer-executable code and/or other data to or within system200. Examples of such media include main memory 215, secondary memory220 (including internal memory 225, removable medium 230, and externalstorage medium 245), and any peripheral device communicatively coupledwith communication interface 240 (including a network information serveror other network device). These non-transitory computer-readable mediaare means for providing executable code, programming instructions,software, and/or other data to system 200.

In an embodiment that is implemented using software, the software may bestored on a computer-readable medium and loaded into system 200 by wayof removable medium 230, I/O interface 235, or communication interface240. In such an embodiment, the software is loaded into system 200 inthe form of electrical communication signals 255. The software, whenexecuted by processor 210, preferably causes processor 210 to performone or more of the processes and functions described elsewhere herein.

In an embodiment, I/O interface 235 provides an interface between one ormore components of system 200 and one or more input and/or outputdevices. Example input devices include, without limitation, sensors,keyboards, touch screens or other touch-sensitive devices, biometricsensing devices, computer mice, trackballs, pen-based pointing devices,and/or the like. Examples of output devices include, without limitation,other processing devices, cathode ray tubes (CRTs), plasma displays,light-emitting diode (LED) displays, liquid crystal displays (LCDs),printers, vacuum fluorescent displays (VFDs), surface-conductionelectron-emitter displays (SEDs), field emission displays (FEDs), and/orthe like. In some cases, an input and output device may be combined,such as in the case of a touch panel display (e.g., in a smartphone,tablet, or other mobile device).

In an embodiment, I/O interface 235 provides an interface to a camera(not shown). for example, system 200 may be a mobile device, such as asmartphone, tablet computer, or laptop computer, with one or moreintegrated cameras (e.g., rear and front facing cameras). Alternatively,system 200 may be a desktop or other computing device that is connectedvia I/O interface 235 to an external camera. In either case, the cameracaptures images (e.g., photographs, video, etc.) for processing byprocessor(s) 210 (e.g., executing the disclosed software) and/or storagein main memory 215 and/or secondary memory 220.

System 200 may also include optional wireless communication componentsthat facilitate wireless communication over a voice network and/or adata network (e.g., in the case of user system 130). The wirelesscommunication components comprise an antenna system 270, a radio system265, and a baseband system 260. In system 200, radio frequency (RF)signals are transmitted and received over the air by antenna system 270under the management of radio system 265.

In an embodiment, antenna system 270 may comprise one or more antennaeand one or more multiplexors (not shown) that perform a switchingfunction to provide antenna system 270 with transmit and receive signalpaths. In the receive path, received RF signals can be coupled from amultiplexor to a low noise amplifier (not shown) that amplifies thereceived RF signal and sends the amplified signal to radio system 265.

In an alternative embodiment, radio system 265 may comprise one or moreradios that are configured to communicate over various frequencies. Inan embodiment, radio system 265 may combine a demodulator (not shown)and modulator (not shown) in one integrated circuit (IC). Thedemodulator and modulator can also be separate components. In theincoming path, the demodulator strips away the RF carrier signal leavinga baseband receive audio signal, which is sent from radio system 265 tobaseband system 260.

If the received signal contains audio information, then baseband system260 decodes the signal and converts it to an analog signal. Then thesignal is amplified and sent to a speaker. Baseband system 260 alsoreceives analog audio signals from a microphone. These analog audiosignals are converted to digital signals and encoded by baseband system260. Baseband system 260 also encodes the digital signals fortransmission and generates a baseband transmit audio signal that isrouted to the modulator portion of radio system 265. The modulator mixesthe baseband transmit audio signal with an RF carrier signal, generatingan RF transmit signal that is routed to antenna system 270 and may passthrough a power amplifier (not shown). The power amplifier amplifies theRF transmit signal and routes it to antenna system 270, where the signalis switched to the antenna port for transmission.

Baseband system 260 is also communicatively coupled with processor 210,which may be a central processing unit (CPU). Processor 210 has accessto data storage areas 215 and 220. Processor 210 is preferablyconfigured to execute instructions (i.e., computer programs, such as thedisclosed application, or software modules) that can be stored in mainmemory 215 or secondary memory 220. Computer programs can also bereceived from baseband processor 260 and stored in main memory 210 or insecondary memory 220, or executed upon receipt. Such computer programs,when executed, enable system 200 to perform the various functions of thedisclosed embodiments.

1.3. Exemplary Systems

Embodiments in which document recognition is performed by user system130 may be referred to as “client-side” embodiments, whereas embodimentsin which document recognition is performed by platform 110 may bereferred to as “server-side” embodiments. In an alternative embodiment,the document-recognition process could itself be split between usersystem 130 and platform 110, with some functions of thedocument-recognition process performed by user system 130 and somefunctions of the document-recognition process performed by platform 110.In any case, the disclosed document-recognition process may be performedby a document-recognition module that is implemented as one or moreexecutable software modules. In an embodiment, the document-recognitionprocess comprises both determining the location or position of adocument in an input image and classifying the document into aparticular type of document (e.g., a particular type of identitydocument). This is what is meant by the terms “location” and“classification,” as used herein.

In the client-side embodiment, the document-recognition module is hostedand executed on a user system 130 (e.g., as client application 132). Inthis case, user system 130 may comprise system 200. Thedocument-recognition module may be stored persistently in secondarymemory 220, and loaded into main memory 215 to be executed byprocessor(s) 210 of user system 130. Updates of the document-recognitionmodule may be automatically or manually downloaded from platform 110,periodically or as needed, when user system 130 has a connection toplatform 110 via network(s) 120. Alternatively, the document-recognitionmodule may be updated by other means or not at all. In either case, usersystem 130 may be a mobile device, such as a smartphone, laptopcomputer, or tablet computer, with an integral or connected camera ordedicated image scanner. For example, in a typical client-sideembodiment, an input image of a document may be captured using a cameraof the mobile device, with the document recognition performed directlyby one or more processors 210 on the mobile device.

Preferably, the time required to perform document recognition in theclient-side embodiment should not exceed one second. See, e.g.,“High-speed OCR algorithm for portable passport readers,” Bessmeltsev etal., 21st Int'l Conference on Computer Graphics and Vision,GraphiCon'2011—Conference Proceedings, 2011, which is herebyincorporated herein by reference as if set forth in full. Mobilerecognition systems are considered “real-time” if they have processingrates of more than two frames per second. See, e.g., “Real-Time MobileFacial Expression Recognition System— A Case Study,” Suk et al., 2014IEEE Conference on Computer Vision and Pattern Recognition Workshops,pp. 132-7, June 2014, which is hereby incorporate herein by reference asif set forth in full. Thus, in a preferred implementation, theclient-side embodiment would be able to recognize the document in animage in under half a second.

In the alternative, server-side embodiment, the document-recognitionmodule is hosted and executed on platform 110 (e.g., as serverapplication 112). In this case, platform 110 may comprise system 200 asa server system. Again, the document-recognition module may be storedpersistently in secondary memory 220, and loaded into main memory 215 tobe executed by processor(s) 210 of platform 110. Images may be uploadedfrom user system(s) 130, through network(s) 120, to platform 110 forprocessing by the document-recognition module. For example, in a typicalserver-side embodiment, an input image that has been captured at a usersystem 130 (e.g., scanned, photographed, or otherwise sensed) istransmitted to remote platform 110 for analysis.

Preferably, the time required to perform document recognition in theserver-side embodiment should not exceed the time required to scan thedocument. Modern sheet-fed scanners have processing rates ranging from20-30 pages per minute for lightweight models (e.g., Canon™ imageFORMULADR-C, Kodak™ i1, etc.) to 200 or more pages per minute for heavy-dutymodels (e.g., Canon™ imageFORMULA DR-G, Kodak™ Alaris i5 series, etc.),with an average processing rate of 60 pages per minute.

An example of input characteristics and requirements for the client-sideand server-side embodiments, described above, is illustrated in Table 1below:

TABLE 1 Client-side Server-side Embodiment Embodiment Input Image Imageframe from video Photograph captured by mobile device or scannerDocument All document boundaries Up to two boundaries Position arevisible in input image may be absent from input image (e.g., whendocument is positioned in a corner of the input image) No. of 0.5-8.0megapixels (e.g., scan of 150-600 pixels per Pixels/Pixel Ultra HighDefinition 4K) inch, or photograph of Density 2.0-15.0 megapixelsMinimum 10 pixels 10 pixels Character Height Processing 0.5 seconds(benchmark: 1 second (benchmark: Time Limit Exynos ™ 7420, AMD ™FX-8350, Intel ™ Apple ™ A8) Core i7 4470S)

It should be understood that the example requirements in Table 1 arenon-limiting. In other words, other characteristics and requirements arepossible, and the above characteristics and requirements simplyrepresent typical objectives. Thus, Table 1 is merely used as an exampleof parameters that the disclosed document-recognition module is capableof satisfying.

In addition, in client-side embodiments in which thedocument-recognition module is part of a larger application, otherfunctions of the larger application may be performed client-side and/orserver-side. Similarly, in server-side embodiments in which thedocument-recognition module is part of a larger application, otherfunctions of larger application may be performed server-side and/orclient-side. As one example of a client-side embodiment, thedocument-recognition module may be executed client-side on user system130 (e.g., a mobile device, such as a smartphone), whereas othermodules, representing the remainder or other portion of the application,may be executed server-side on platform 110. In this case, the result ofthe document recognition or a result of additional processing based onthe result of the document recognition, performed on user system 130,may be transmitted from user system 130 to platform 110 via network(s)120. Then, an overall result of the application may be determined byplatform 110, and transmitted from platform 110 to user system 130and/or an external system 140 via network(s) 120.

Preferred embodiments of the document-recognition module are bothscalable and trainable. For scalability, the document-recognition moduleshould be capable of simultaneously supporting hundreds of differentdocument types. For example, there are more than one hundred differenttemplates of drivers licenses in the United States, since there arefifty states and each state has two or more templates. In an embodiment,the document-recognition module, whether client-side or server-side, iscapable of recognizing all of these different document types (i.e., theparticular template of the state which issued the drivers license in theimage). Additionally or alternatively, the document-recognition modulemay be configured to identify other types of documents (e.g., otheridentity documents, such as passports, employee identification cards,etc., and/or other non-identity documents).

For trainability, the document-recognition module should not require alarge training dataset. This is particularly true for adocument-recognition module that is intended to recognize identitydocuments (e.g., drivers licenses, passports, etc.), since samples ofidentity documents are generally not published or otherwise openlyavailable in large numbers due to legal restrictions and securityconcerns.

2. Process Overview

Embodiments of processes for efficient recognition of documents inimages will now be described in detail. It should be understood that thedescribed processes may be embodied in one or more software modules thatare executed by one or more hardware processors (e.g., processor 210),for example, as the application discussed herein (e.g., serverapplication 112, client application 132, and/or a distributedapplication comprising both server application 112 and clientapplication 132), which may be executed wholly by processor(s) ofplatform 110, wholly by processor(s) of user system(s) 130, or may bedistributed across platform 110 and user system(s) 130, such that someportions or modules of the application are executed by platform 110 andother portions or modules of the application are executed by usersystem(s) 130. The described processes may be implemented asinstructions represented in source code, object code, and/or machinecode. These instructions may be executed directly by the hardwareprocessor(s), or alternatively, may be executed by a virtual machineoperating between the object code and the hardware processors. Inaddition, the disclosed application may be built upon or interfaced withone or more existing systems.

Alternatively, the described processes may be implemented as a hardwarecomponent (e.g., general-purpose processor, integrated circuit (IC),application-specific integrated circuit (ASIC), digital signal processor(DSP), field-programmable gate array (FPGA) or other programmable logicdevice, discrete gate or transistor logic, etc.), combination ofhardware components, or combination of hardware and software components.To clearly illustrate the interchangeability of hardware and software,various illustrative components, blocks, modules, circuits, and stepsare described herein generally in terms of their functionality. Whethersuch functionality is implemented as hardware or software depends uponthe particular application and design constraints imposed on the overallsystem. Skilled persons can implement the described functionality invarying ways for each particular application, but such implementationdecisions should not be interpreted as causing a departure from thescope of the invention. In addition, the grouping of functions within acomponent, block, module, circuit, or step is for ease of description.Specific functions or steps can be moved from one component, block,module, circuit, or step to another without departing from theinvention.

Furthermore, while the processes, described herein, are illustrated witha certain arrangement and ordering of steps, each process may beimplemented with fewer, more, or different steps and a differentarrangement and/or ordering of steps. In addition, it should beunderstood that any step, which does not depend on the completion ofanother step, may be executed before, after, or in parallel with thatother independent step, even if the steps are described or illustratedin a particular order.

2.1. Introduction

FIG. 3 illustrates a process 300 in which document recognition isperformed, according to an embodiment. Process 300 may be performedclient-side by user system 130 (e.g., as client application 132executing on a mobile device) or server-side by platform 110 (e.g., asserver application 112). Alternatively, process 300 may be a distributedprocess with some subprocesses, such as subprocesses 310 and 320, beingperformed client-side by user system 130 and other subprocesses, such assubprocesses 330 and 340, being performed server-side by platform 110.It should be understood that subprocess 320 represents the discloseddocument-recognition process, which may be implemented by thedocument-recognition module described herein, which may for part of thelarger application that implements process 300. Alternatively,subprocess 320 could be a stand-alone process by itself or incombination with subprocess 310.

In subprocess 310, an input image is received. Subprocess 310 maycomprise capturing the input image using a digital camera (e.g., asmall-scale digital camera integrated into or connected to user system130) or hardware scanner. In the case of a hardware scanner, the scannedinput image may have a resolution between 150 pixels per inch (PPI) to600 PPI or higher. In an alternative embodiment, subprocess 310 maycomprise receiving a previously captured input image (e.g., capturedearlier in time and/or by another device). In either case, the inputimage may comprise a stand-alone photograph or an image frame from avideo stream.

In subprocess 320, a document, represented in the input image, isrecognized, according to the document-recognition process describedherein. During subprocess 320, the input image may also be pre-processedand/or post-processed. For example, once the document-recognition moduleidentifies the location of the document in the input image, a portion ofthe input image that does not contain the document may be cropped out,such that little or no background remains in the output image. Ingeneral, the position (e.g., location and orientation) of the documentin the input image, including the angular rotation of the document,could be arbitrary. In addition, an input image that has been capturedby a camera of a mobile device may include highlights and other lightingvariations, and the document in the input image may have projectivedistortions.

In subprocess 330, the output of the document recognition in subprocess320 may be utilized in one or more additional subprocesses to produce aresult that is output in subprocess 340. Subprocess 330 may utilize theposition of the document and the type of the document, recognized insubprocess 320, to extract data from the image of the document. Forexample, this data extraction may comprise applying optical characterrecognition (OCR) to text fields within the boundaries of the locateddocument in order to extract text from the document, extracting imageswithin the boundaries of the document (e.g., headshots of a personidentified by the document), decoding a barcode (e.g., one-dimensionalbarcode, matrix barcode, such as a Quick Response (QR) code, etc.)and/or other type of code within the boundaries of the document toproduce corresponding character strings, and/or the like.

While the document recognition in subprocess 320 could be performed as astand-alone function, it is most beneficial in the context of a largerprocess, such as process 300. Despite the widespread development oftext-in-the-wild methods, as described, for example, in “Scene TextDetection and Recognition: The Deep Learning Era,” Long et al.,arXiv:1811.04256, 2018, which is hereby incorporated herein by referenceas if set forth in full, it is more efficient, in terms of computationalperformance, to locate a document prior to text recognition (e.g., OCR).Thus, the disclosed process for detecting the precise coordinates ofdocument boundaries in subprocess 320 can greatly benefit a textrecognition process in an embodiment of subprocess 330.

In order to find the coordinates of the document's boundaries, it isgenerally sufficient to estimate the distortion parameters in the formof a homography matrix. In more specific cases, an affine transformationmatrix or other matrix may be used. Notably, identity documents aresometimes characterized by text fields with fixed positions. Thus,knowledge of the type of document in the input image and the distortionparameters enables subprocess 330 to identify zones in the input imagethat represent text fields, without extra computation.

2.2. Document Recognition

In an embodiment, document recognition, which may correspond tosubprocess 320 in FIG. 3 , comprises identifying the position (e.g.,location and/or orientation) of a document in an input image,identifying the type of document in the input image, and/or determiningthe projective distortion parameters for the document in the inputimage. The disclosed document-recognition process is universal in thatit may be used for both client-side and server-side embodiments.

To start, “Complex Document Classification and Localization Applicationon Identity Document Images,” Awal et al., 14th Int'l Association forPattern Recognition (IAPR) Int'l Conference on Documents Analysis andRecognition (ICDAR), IEEE, vol. 1, pp. 426-31, 2017, which is herebyincorporated herein by reference as if set forth in full, describes amultitude of approaches for classifying documents. Of these, the methodsbased on visual document representation are the most suited forclassifying identity documents into different types. Such methodsinclude a number of neural-network based methods, as described in“Evaluation of Deep Convolutional Nets for Document Image Classificationand Retrieval,” Harley et al., 13th ICDAR, IEEE, pp. 991-5, 2015, and“Efficient Character-level Document Classification by CombiningConvolution and Recurrent Layers,” Xiao et al., arXiv:1602.00367, 2016,which are both hereby incorporated herein by reference as if set forthin full. However, high classification accuracy requires architectureswith fully connected layers containing large amounts of weights, andthis require large training datasets. As mentioned above, in the case ofdocument recognition for identity documents, large training datasets ofthe size required for such neural networks are not generally available.Other methods include a combination of statistical learning with methodsof data synthesis and augmentation.

However, there exists a classification approach that is based onpairwise image matching. In this approach, a compact representation iscomputed for images of documents, which is robust against certaindistortions. A compact representation of an image that uses a set ofkeypoints and descriptors associated with those keypoints has advantagesover representations that use global descriptors or a set of localdescriptors without spatial information (referred to as a “bag of localfeatures”). An example of a representation that uses a set of keypointsand associated descriptors is described in “Object Recognition fromLocal Scale-Invariant Features,” Lowe, Int'l Conference on ComputerVision (ICCV), IEEE, p. 1150, 1999, which is hereby incorporated hereinby reference as if set forth in full. An example of a representationthat uses global descriptors is described in “Fine-grainedclassification of identity document types with only one example,” Rodneret al., 14th IAPR Int'l Conference on Machine Vision Applications (MVA),IEEE, pp. 126-9, 2015, which is hereby incorporated herein by referenceas if set forth in full. An example of a representation that uses a setof local descriptors without spatial information is described in“Document Image Retrieval with Local Feature Sequences,” Li et al., 10thICDAR,” IEEE, pp. 346-50, 2009, which is hereby incorporated herein byreference as if set forth in full.

Due to its advantages over other methods, an embodiment of the discloseddocument-recognition process utilizes keypoints and associateddescriptors to represent documents in images. This pairwise set ofkeypoints and associated descriptors may also be referred to as a“constellation of features.” The constellation of features for an imagecomprises a set of local features and information about those localfeatures' mutual spatial relationships. This constellation-of-featuresmodel is more robust against inter-class collisions than thebag-of-local-features model. However, the general mapping of theconstellation of features to a metric space is not trivial to define,thereby restricting the usage of data structures for fastnearest-neighbor searches.

“Better matching with fewer features: The selection of useful featuresin large database recognition problems,” Turcot et al., 12th ICCVWorkshops, IEEE, pp. 2109-16, 2009, which is hereby incorporated hereinby reference as if set forth in full, describes a two-step scheme thatcombines the advantages of the constellation-of-features model with thebag-of-local-features model. In the first step, an approximatenearest-neighbor search is performed in a reference database using thebag-of-local-features model. In the second step, the geometriccorrespondence between the input image and each of the candidatesselected in the first step is estimated. This two-step scheme has highclassification accuracy for identity documents in scanned input imagesand in mixed datasets.

“Semi-structured document image matching and recognition,” Augereau etal., Document Recognition and Retrieval XX, vol. 8658, Int'l Society forOptics and Photonics, p. 865804, 2013, which is hereby incorporatedherein by reference as if set forth in full, illustrates that filteringthe false correspondences of local features and checking that theobtained solution is well-conditioned dramatically increasesclassification accuracy. This filtering requires several samples of eachdocument type.

Awal et al. describes a generalization of this approach that allows forprojective distortion of the document in the input image. Awal et al.also features a filtering method based on a single sample of eachdocument type. This filtering method comprises identifying areas in thesample image that contain variable data, and excluding featuresextracted from those areas.

The authors of both Augereau et al. and Awal et al. exclusivelyconsidered the classification accuracy of their algorithms. However, inaddition to classifying a document, the disclosed document-recognitionmodule may determine the position of a document's boundaries.Furthermore, the speed and accuracy of the disclosed approach can befurther improved by combining it with methods for detecting geometricprimitives, such as lines and quadrangles, as described, for example, in“Document localization algorithms based on feature points and straightlines,” Skoryukina et al., 10th Int'l Conference on Machine Vision(ICMV), Int'l Society for Optics and Photonics, vol. 10696, p. 106961H,2018, which is hereby incorporated herein by reference as if set forthin full.

The determination of the type of the document in an input image and thedetermination of the distortion parameters of that document can beperformed independently from each other or within a single process. Thedocument in an input image may be located using methods for extractingboundary elements, segments, and/or lines. A quadrangle may then beconstructed using these detected geometric primitives by traversing theintersections graph, as discussed, for example, in “Segments Graph-BasedApproach for Document Capture in a Smartphone Video Stream,” Zhukovskyet al., 14th IAPR Int'l Conference on Document Analysis and Recognition,IEEE, vol. 1, pp. 337-342, 2017, which is hereby incorporated herein byreference as if set forth in full, or by searching through alternativesusing a system of penalties and heuristic pruning in various stages, asdiscussed, for example, in “Real time rectangular document detection onmobile devices,” Skoryukina et al., 7th Int'l Conference on MachineVision, Int'l Society for Optics and Photonics, vol. 9445, p. 94452A,2015, which is hereby incorporated herein by reference as if set forthin full.

Image classes (i.e., representing different types of documents) may bedefined for N document types as follows:

C={C _(i)}_(i∈[0,N]),

wherein C_(i) is the class of images with the i-th document type fori∈[1, N]. Class C₀ may be defined as a class of other images (e.g., forimages that cannot be classified as another document type).

The goal of document recognition is to determine the class C_(i) for agiven query or input image Q. For each class C_(i), i∈[1, N], a templateimage T_(i), representing an ideal image of a document, is obtained. Anideal image of a document is an image in which the document has beencaptured under ideal circumstances, including ideal lighting (e.g.,uniform lighting, no bright spots, proper exposure etc.), in focus(e.g., high contrast, etc.), no distortions (e.g., no warping, no skew,etc.), and nothing obscuring the document. At the very least, eachtemplate image T should be generated from an ideal image of a documentthat has been captured in uniform lighting conditions without projectivedistortions. For example, the ideal images to be used as template imagesT may be obtained by using a flatbed scanner to capture an image of adocument, which represents the prototype for a particular class or typeof document, and then cropping the captured image to the boundaries ofthe document to remove any background.

If the determined class C_(i) is associated with a template image T_(i),the transformation H:Q→T_(i) can be estimated to map points in the inputimage Q to corresponding points in the template image T_(i). To estimatethis transformation, a family of projective transformations may be used.The projective transformations, as described by a pinhole camera, may beexpressed as a 3×3 matrix operator.

In an embodiment, the document-recognition module uses the Speeded-UpRobust Features (SURF) algorithm, as described in Awal et al., to selectthe centers of informative regions (i.e., keypoints) in the input imageQ and in each template image T_(i). The neighborhoods of these keypointsare encoded using local metric descriptors. Each keypoint p and itsdescriptor f in the input image Q is matched with keypoints, fromtemplate images T_(i), whose descriptors are closest to the point p'sdescriptor f in terms of the Euclidean metric (also referred to as“Euclidean distance”). To speed up the matching process, the keypointsand descriptors of the template images T_(i) in the reference databasemay be indexed. Specifically, for each template image T_(i), eachkeypoint p with its associated descriptor f may be placed in ahierarchical clustering tree or randomized k-d tree.

Each template image T_(i), having keypoints matching the keypoints ininput image Q, may be assigned a rank r. Rank r may be calculated as thenumber of keypoints, classified as neighbors of descriptor f, divided bythe total number of keypoints in the template image T_(i). However,alternatives methods may be used to calculate the rank r for eachtemplate image T_(i).

Geometric validations may be performed for those template images T_(i)with the highest rank r. This geometric validation may comprise, foreach template image T_(i) that is a candidate to match input image Q,calculating a Random Sample Consensus (RANSAC) transformation matrixthat maps keypoints in input image Q to keypoints in the template imageT_(i). A pair of keypoints with close descriptors is considered to be ageometrically valid match if the RANSAC transformation matrix maps thepair of keypoints to each other within a predefined accuracy threshold.In the context of documents, the transformation will be consideredgeometrically valid if the corners of the documents in the images definea quadrangle that complies with a set of predefined conditions.

One template image T may be selected, from the set of possiblecandidates, based on the ranking and geometric validation of all of thetemplate images T in the set of possible candidates. For example, thetemplate image T with the highest number of matching keypoints withgeometrically valid transformations may be selected. The document typeassociated with selected template image Tis then identified as thedocument type of the document in input image Q.

The above combination of the constellation-of-features model withapproximate nearest-neighbor searches has demonstrated high accuracywhen applied to document classification. However, there are a number ofproblems with this approach:

-   -   For some types of identity documents, the positioning or        structure of static elements does not allow a well-conditioned        transformation matrix H to be constructed.    -   The differences in image sizes between input images Q may be up        to 400% in some implementations of the server-side embodiment.    -   The SURF algorithm in Awal et al. determines the scale of input        image Q before local feature extraction, but research on        robustness demonstrates that the percentage of correctly matched        input images with changing scale is than 50% or less. See, e.g.,        “A comparison of SIFT, PCA-SIFT and SURF,” Juan et al., IJIP,        vol. 3, no. 4, pp. 143-52, 2009, which is hereby incorporated        herein by reference as if set forth in full.    -   The speed of this approach is generally not sufficient for        mobile devices. For example, this approach cannot satisfy the        requirements in Table 1 for the client-side embodiment. See,        e.g., “Mobile Image Analysis: Android vs. iOS,” Cobâszan et al.,        Int'l Conference on Multimedia Modeling, Springer, pp. 99-110,        2015, which is hereby incorporated herein by reference as if set        forth in full.

The above approach utilizes “points” as local features. In anembodiment, the disclosed document recognition improves on this approachby utilizing document boundaries and content structures (e.g.,photographs, tables, and/or the like in the document) to yield othertypes of local features, such as lines (e.g., full lines or linesegments) and/or quadrangles. The extraction of these higher-levelfeatures enables the document-recognition module to compensate forprojective distortions, with the possible exception of scale (see, e.g.,“Robust Perspective Rectification of Camera-Captured Document Images,”Takezawa et al., 14th IAPR ICDAR, IEEE, vol. 6, pp. 27-32, 2017, whichis hereby incorporated herein by reference as if set forth in full) and90° rotation.

In light of the problems discussed above, in an embodiment, thedocument-recognition module uses less informative keypoint detectors anddescriptors in combination with fast methods of locating geometricprimitives, such as lines and quadrangles. Once the geometric primitivesare located, they may be used to obtain more relevant descriptor valuesand to validate the geometric correctness of the result.

FIG. 4 illustrates a document-recognition process, according to anembodiment. In subprocesses 405-420, a plurality of template images T,are converted into compact representations that are stored in areference database and, in subprocesses 425-460, which correspond to thedisclosed document-recognition module, an input image Q is matched tothe reference database to identify a document type and one or moredistortion parameters for a document in the input image Q. It should beunderstood that subprocesses 405-420 may be performed serially or inparallel for each template image T_(i), and subprocesses 425-460 may beperformed serially or in parallel for each input image Q.

2.2.1. Processing of Template Images

In subprocess 405, template images T_(i), representing ideal images ofdocuments, are normalized. For each template image T, this normalizationmay comprise scaling the width of the template image T to a standardvalue with the same aspect ratio, and smoothing the template image Tusing an edge-preserving blur (e.g., bilateral filter). This normalizingpre-processing enhances keypoint detection, and simplifies scaleestimation since images with the same aspect ratio will have the samesize.

In subprocess 410, features (e.g., keypoints and descriptors) areextracted from each template image T. In an embodiment, the Yet AnotherContrast-Invariant Point Extractor (YACIPE) algorithm is used forkeypoint detection, due to its computational performance. However, itshould be understood that other keypoint-detection algorithms may beused. The YACIPE algorithm is described, for example, in “Modificationof YAPE keypoint detection algorithm for wide local contrast rangeimages,” Lukoyanov et al., 10th ICMV, Int'l Society for Optics andPhotonics, vol. 10696, p. 1069616, 2018, which is hereby incorporatedherein by reference as if set forth in full. The YACIPE algorithmrepresents each keypoint using coordinates and a score (e.g.,

x, y, YACIPE score

). The neighborhood size and orientation for each keypoint are notcomputed. For each keypoint, receptive field descriptors (RFDs) arecalculated for an image region (e.g., a 32×32 pixel region) around thekeypoint to produce a feature vector. This calculation is described, inan example, in “Receptive Fields Selection for Binary FeatureDescription,” Fan et al., IEEE Transactions on Image Processing, vol.23, no. 6, pp. 2583-95, 2014, which is hereby incorporated herein byreference as if set forth in full. In an embodiment, the resultingfeature vector for each keypoint comprises 297 binary features. In theconstellation-of-features model of an embodiment, an image I (e.g., aninput image Q or template image T) can be represented as follows:

ω=W(I)={

p _(i) ,f _(i)

}_(i∈[1,M]),

wherein p_(i)=(x_(i), y_(i)), representing the coordinates of the i-thkeypoint in the image I, wherein f_(i) is the descriptor of theneighborhood of the i-th keypoint in the image I, and wherein M is thenumber of keypoints detected in the represented image I.

In subprocess 415, the features extracted in subprocess 410 arefiltered. In other words, some of the extracted features may bediscarded. For example, as described in Awal et al., zones with variabledata (e.g., text or images that will vary across different instances ofdocuments of the same type) may be selected in each template image T.All keypoints in these zones of variable data in the template image Tare discarded when calculating the compact representation of thetemplate T, i.e., ω_(i)=W(T). Examples of such zones are depicted by thehighlighted portions in the template image T in FIG. 5A. In anembodiment, the remaining keypoints are sorted in descending order ofYACIPE score, and the M keypoints with the highest YACIPE scores arestored as the compact representation of the template image T in thereference database in subprocess 420 to be used for feature matching. Insubprocess 420, the features in each compact representation may bestored in the reference database in an index tree (e.g., hierarchicalclustering tree, a randomized k-d tree, etc.).

The RFDs are robust for angular rotation of up to 15°. In addition, theuse of lines (e.g., full lines and line segments) in the image enablesthe document-recognition module to determine any angular position of adocument, with the exception of a 90° rotation. Thus, to fully accountfor the classification of rotated documents, angular rotations of 0°,90°, 180°, and 270° should be addressed. It is more computationallyefficient to account for these rotations in subprocesses 405-420 fortemplate images T_(i), since these subprocesses can be performedoffline. Accordingly, in an embodiment, a separate template image T maybe obtained for each type of document at each of the 0°, 90°, 180°, and270° rotations. Consequently, each template image T will have fourcompact representations stored in the reference database: a firstcompact representation of an ideal image of the document rotated at 0°,a second compact representation of the ideal image of the documentrotated at 90°, a third compact representation of the ideal image of thedocument rotated at 180°, and a fourth compact representation of theideal image of the document rotated at 270°. It should be understoodthat a match of an input image Q to any of these four compactrepresentations in the reference database will result in a match to thesame associated document type.

A trivial matching process for each descriptor in an input image Q toeach descriptor of a template image T will lead to linear dependence onthe number of template images T_(i). Thus, in an embodiment, compactrepresentations may be processed as described in “Fast Matching ofBinary Features,” Muja et al., 9th Conference on Computer and RobotVision, IEEE, pp. 404-10, 2012, which is hereby incorporated herein byreference as if set forth in full. In particular, for each point j∈[1,|ω_(i)|] in each compact representation of an image, an entry

i, f_(j) ^(i)

is added into a data structure which enables a nearest neighbor search,such as a hierarchical clustering tree. Then, in order to expedite thematching process, the descriptor of each entry of the compactrepresentation of the input image Q is searched in this data structure,and the list of its nearest neighbors is obtained from among the entriesof the compact representations of the template images T_(i). Based onthe frequency of templates images T_(i) having a neighboring entry tothe entry of the input image Q, the list of template images T_(i) may bepruned, thereby constraining the candidate template images T_(i) duringmatching.

2.2.2. Matching of Input Images

In an embodiment, each input image Q is analyzed at least twice: (1) tolocate lines and quadrangles; and (2) to perform keypoint analysis. Thefirst analysis is represented by subprocess 425, whereas the secondanalysis is represented by subprocess 435.

The particular method used for locating lines and quadrangles may dependon the specifics of the document-recognition module, such as whether thedocument-recognition module is intended as a client-side or server-sideembodiment. For example, the algorithm for fast quadrangle detection inSkoryukina et al. satisfies the example requirements of the client-sideembodiment specified in Table 1, and therefore, may be used in anembodiment of subprocess 425. Restrictions on the size of the documentin input image Q (e.g., an image frame) enable the algorithm to setregions of interest (ROIs) for each side of the document and performdetection of each side independently. Subprocess 425 may comprisecalculating a grayscale boundaries map for each ROI, and subsequentlyperforming the following procedure, iteratively, on each boundaries map:

-   -   (1) Apply the Fast Hough Transform to the boundaries map. The        Hough Transform is described, for example, in “Point-to-line        mappings as Hough Transforms,” Bhattacharya et al., Pattern        Recognition Letters, vol. 23, no. 14, pp. 1705-10, 2002, and        U.S. Pat. No. 3,069,654, issued Dec. 18, 1962, which are both        hereby incorporated herein by reference as if set forth in full.    -   (2) Identify the cell with the highest value in the Hough        parameter space as defining the candidate line and its weight.    -   (3) If the number of obtained candidate lines is not sufficient,        erase the boundaries in the boundaries map in the neighborhood        of the candidate line, and perform another iteration of (1)-(3).

Subprocess 425 may generate a set of candidate quadrangles usingpairwise intersection of lines across different ROIs. A weight may beassigned to each candidate quadrangle as a sum of the weights of itsconstituent lines. The weights of the candidate quadrangles may then becorrected by, for each candidate quadrangle, reconstructing the originalprototype parallelogram, and estimating the discrepancy in relation tothe document model, in terms of aspect ratio and angle values of thedocument corners.

While the above method satisfies the example requirements of theclient-side embodiment in Table 1, it does not necessarily satisfy theexample requirements of the server-side embodiment in Table 1. This isbecause the document in a scanned image can be arbitrarily positioned,which makes the definition of ROI for the document boundariesproblematic. However, the more relaxed constraints on computationalperformance for the server-side embodiment enable the method describedin Zhukovsky et al. to be employed for subprocess 425. Notably, thismethod can be used to analyze both scanned input images and imagesobtained using a camera (e.g., of a user system 130, such as a mobiledevice).

In this alternative embodiment, subprocess 425 comprises a two-stepdetection of line segments in the input image. In the first step,8-connected contours are collected using a binary boundaries map, andlinear segments are extracted. In the second step, additional linearsegments are extracted. These additional linear segments are extractedby applying a Progressive Probabilistic Hough Transform to a grayscaleboundaries map, as described, for example, in “Robust Detection of LinesUsing the Progressive Probabilistic Hough Transform,” Matas et al.,Computer Vision and Image Understanding, vol. 78, no. 1, pp. 119-37,2000, which is hereby incorporated herein by reference as if set forthin full.

Each of the extracted linear segments are classified as either mostlyhorizontal or mostly vertical. Based on the classifications of theextracted linear segments, the document-recognition module generates anintersections graph, in which each vertex corresponds to one of thelinear segments and each edge corresponds to the intersection point oflines defined by two linear segments. If linear segments correspondingto adjacent vertices are orthogonally oriented, an edge is added to theintersections graph. Each intersection point may be tagged with the typeof document corner (e.g., top-left, top-right, bottom-left,bottom-right, etc.) that corresponds to the intersection point. Theintersections graph is transformed to a four-partite directed graph, insuch a way that each edge, corresponding to an intersection point,corresponds to a single type of document corner. A cycle composed offour edges in the intersections graph represents a quadrangle in theinput image. Thus, each four-edged cycle may be extracted as a candidatequadrangle. In addition, each extracted candidate quadrangle can beweighted in an analogous manner to the method described above.

Both of the alternative embodiments of subprocess 425, described above,result in a set of lines and a set of quadrangles, which may bedescribed as follows:

lines: {l=

a,b,c

:ax+by+c=0}

quadrangles: {q=

p ₁ ,p ₂ ,p ₃ ,p ₄

}

Subprocess 430 determines image-processing parameters prior toextraction of keypoints and the computation of associated descriptors.These parameters may comprise scale and angle, and may depend on thescores of lines and quadrangles found in subprocess 425. The followingcases will be considered:

-   -   (1) At least one quadrangle was found, and the best quadrangle        (i.e., greatest weight) is valid (i.e., accepted). A quadrangle        may be determined to be valid when the weight of the quadrangle        is greater than a predefined threshold.    -   (2) At least one quadrangle was found, but the best quadrangle        is invalid (i.e., rejected). A quadrangle may be determined to        be invalid when the weight of the quadrangle is less than the        predefined threshold.    -   (3) No quadrangles were found, indicating that there was an        insufficient number of lines found, or it is impossible to        obtain a quadrangle from the lines using a projective        transformation described by a pinhole camera.

In the first case, the best quadrangle can be used to determine thescale and rotation of the document in the input image. Specifically, thescale and rotation of this best quadrangle may be used as the scale androtation parameters for subsequent subprocesses.

In the second case, it is unreasonable to “trust” the best quadrangle.Thus, in the second case, the best quadrangle may be used to determinerotation, but not scale. For example, all constituent lines of thecandidate quadrangles that comply with the geometric restrictions may beselected. The selected lines may be sorted by the angle α between eachselected line and the horizon (e.g., a horizontal reference line). Then,a single line, for which the maximum number of lines is enclosed in anangular window of size Δα, is selected. In this case, angle α can beconsidered the rotation angle of the document, and used as the rotationparameter for subsequent subprocesses. A predefined default value may beused as the scale parameter.

In the third case, the acquired geometric characteristics do not allow aconfident determination of one or both of the scale and rotationparameters. Thus, predefined default values may be used for both thescale and rotation parameters.

In subprocess 435, features are extracted based on the parameters (e.g.,scale and/or rotation angle) determined in subprocess 430. Subprocess435 may be similar or identical to subprocess 410, or may be differentthan subprocess 410. Regardless of the particular implementation,subprocess 435 extracts keypoints from input image Q, with the size andorientation of local neighborhoods determined by the scale and rotationparameters provided by subprocess 430. FIG. 5A illustrates theextraction of keypoints in an example input image Q, as well as atemplate image T. For each keypoint p, the descriptor f is calculatedbased on its local neighborhood.

In subprocess 440, the features, extracted in subprocess 435, arematched to the compact representations of template images T_(i) in thereference database. In an embodiment, subprocess 440 comprisesperforming an approximate nearest neighbor search, in a search tree, foreach descriptor f using the Hamming metric. The neighbors may befiltered, such that only those neighbors that are closer to descriptor fthan a predefined threshold are considered. A voting scheme may beapplied to the descriptors in the reference database that are matched todescriptors extracted from input image Q in subprocess 435. Inparticular, a match to a candidate descriptor f_(j) ^(i) in thereference database adds a vote to the template image T_(i). Ultimately,the template images T_(i) that have received votes may be sorted indescending order of the number of votes that they received, and the Fbest template images T_(i) are selected from the sorted list. F may beany integer greater than or equal to one (e.g., one, two, three, five,ten, etc.).

2.2.3. Geometric Validation

In subprocess 445, the geometry of the features in input image Q,extracted in subprocess 435, is validated against each of the F besttemplate images T_(i) selected in subprocess 440. In an embodiment,subprocess 445 comprises calculating a RANSAC projective transformationhypothesis H for each of the candidate template images T_(i) selected insubprocess 440. Each hypothesis H transforms points in the input image Qto points in the respective template image T within some margin ofdiscrepancy. For a given hypothesis H, a pair of points, p and p′, withsimilar descriptors is considered a valid match (i.e., inlier) if:

|H(p)−p′|<δ,

p∈W(Q),

p′∈W(T),

wherein δ is the inlier threshold.

As shown in Augereau et al., if only the general RANSAC parameters areused, the number of iterations and the inlier threshold δ are notsufficient for filtering out false hypotheses. Thus, in an embodiment,one or both of the following restrictions are used to provide additionalfiltering:

-   -   The distance between any two points, in a candidate template        image T, which are used to construct the hypothesis, cannot be        less than a minimum distance d_(min).    -   The quadrangle from the input image Q must be convex and none of        its corners can lie outside the frame by more than a maximum        distance d_(max) from the boundaries of the frame.

In general, the input image Q and a candidate template image T areconnected with a projective transformation that can be computed usingfour pairs of matched points. In an embodiment, the classic iterativeRANSAC scheme is complemented with information about the extracted linesand quadrangles as follows:

-   -   Each quadrangle that is found in the input image Q and which is        geometrically valid yields an additional hypothesis that        transforms the quadrangle's vertices to the corners of the        template image T.    -   The selected lines are formed into clusters which define        vanishing points. A pair of orthogonal vanishing points defines        a transform H₀, such that H=H₀×H₁, in which H₁ is a        supplementary transform that encodes image scale and shift. H₁        can be estimated using RANSAC. It is sufficient to use only two        pairs of points to form the hypothesis H₁.

Let G_(i)(H) denote the number of inliers of hypothesis H for a templateimage T_(i). The hypothesis H* with the maximum number of inliers isselected as the result. If G_(i)(H*) is smaller than a predefinedthreshold R, the document type may be determined to be undefined. In theevent that two candidates have the same number of inliers, an additionalestimation may be calculated as follows:

${{e_{i}(H)} = {\frac{1}{G_{i}(H)}{\sum\limits_{j = 0}^{G_{i}(H)}\frac{❘{{H\left( p_{j}^{Q} \right)} - p_{j}^{i}}❘}{\delta}}}},$δ > 0, p_(j)^(Q) ∈ W(Q), p_(j)^(i) ∈ W(T_(i)).

FIG. 5B illustrates an example of quadrangle hypotheses that weregenerated using RANSAC for an example of a template image T and inputimage Q, according to an embodiment. In this example, the template imageT has poor structure of static elements. The best hypothesis (i.e., thefinal answer) for the quadrangle is highlighted by the lighter border.

If subprocess 445 determines that none of the candidate template imagesT_(i) have valid geometries (i.e., “No” in subprocess 450), input imageQ may be rejected in subprocess 455. Otherwise, if subprocess 445determines that at least one of the candidate template images T_(i) hasa valid geometry (i.e., “Yes” in subprocess 450), the best candidatetemplate image T that has a valid geometry may be selected in subprocess460. The type of document associated with the selected template image Tmay be output in subprocess 460 (e.g., as an identifier that identifiesthe type of document). In addition, the distortion parameters of inputimage Q may be determined, based on the selected template image T, andoutput in subprocess 460. For example, the distortion parameters maycomprise a homography matrix that relates the document's position ininput image Q to the selected template image T In an embodiment, thedocument type and distortion parameters, output by subprocess 460, maybe utilized in one or more further subprocesses, as represented bysubprocess 330 in process 300 in FIG. 3 .

3. Example Experimental Evaluation

A particular implementation of the disclosed document recognition wastested and evaluated. For testing, the open Mobile Identity DocumentVideo (MIDV) 500 dataset was used. The MIDV-500 dataset is described,for example, in “MIDV-500: A Dataset for Identity Documents Analysis andRecognition on Mobile Devices in Video Stream,” Arlazarov et al.,arXiv:1807.05786, 2018, which is hereby incorporated herein by referenceas if set forth in full. The MIDV-500 dataset contains images of fiftydifferent types of identity documents in two parts: (1) fifty sourceimages; and (2) fifteen thousand video frames with a resolution of1920×1080, obtained using smartphone cameras.

For evaluation of the client-side embodiment, 9,791 frames, in which thedocument is fully visible, were selected from the MIDV-500 dataset. Toevaluate the server-side embodiment, an additional dataset, comprisingboth scanned images and photographs, was created. Specifically, thisadditional dataset comprised 250 scanned images with a resolution of2400×1600, obtained using a Canon™ CanoScan LiDE 220 scanner, 250photographs with a resolution of 4000×3000, captured using an Apple™iPhone 7, and ground truth text files containing the coordinates ofquadrangles, representing document boundaries, for each of the images.Samples were printed using an HP™ LaserJet printer and laminated.

The processing described in Arlazarov was used to prepare the documentprototypes. The original source images of each type of document in theMIDV-500 dataset were used as template images T. The method in Awal etal. was used as a baseline for comparison to the discloseddocument-recognition process. The method in Awal et al. was evaluatedaccording to the description in Awal et al. and using the same templateimages T_(i) and zones of variable data as used for the testedembodiment of the disclosed document-recognition module. For furthercomparison, the disclosed document-recognition process was modified toremove the line and quadrangle detection (e.g., subprocess 425), andthis modified process was also evaluated. Below, Table 2 depicts thedocument classification and location accuracy for each tested method,and Table 3 depicts the average processing time for the Awal et al.method and the client-side and server-side embodiments of the disclosedmethod:

TABLE 2 Document Classification and Location Accuracy client-sideserver-side Method (MIDV-500) (add'l dataset) Classification Awal et al.72.5% 62.2% No Lines/Quadrangles 46.1% 31.8% Lines/Quadrangles 76.0%67.6% Location No Lines/Quadrangles 41.7% 26.8% Lines/Quadrangles 70.0%59.0%

TABLE 3 Average Processing Time Method Intel Core i7-4770S, 8 Gb RAMApple A8 (iPhone 6) Awal et al. 1.06 seconds 4.22 seconds Client-Side0.10 seconds 0.35 seconds Server-Side 0.78 seconds 2.82 seconds

Since one ultimate goal of a recognition system is to extract the fieldvalues from the document, the criterion and threshold used in Skoryukinaet al. were used. Specifically, location error was defined as themaximum deviation of the computed coordinates of the document corners,divided by the length of the shortest side of the document boundary. Thedocument location is considered correct if the document type iscorrectly identified and the location error is less than 0.06. Whenevaluating computational performance, the time required for thedetection of keypoints, the computation of descriptors, and themodification of the search data structures was not included. The valuesfor performance and document classification presented in the tables wereobtained with the following parameter values: the number of keypointswas restricted to 1,500 on input images and 450 on template images; theHamming distance threshold for the neighbors search was 60 for 297-bitdescriptors; the number of RANSAC iterations was 8,000 for hypothesesbased on four-point pairs and on two-point pairs; the minimum distanced_(min) was 50 (i.e., 10% of the minimum side of template images); andeight candidates were passed to RANSAC for geometric validation (i.e.,F=8).

As shown in Table 2, the disclosed document-recognition methodoutperformed the Awal et al. method and the method that excluded linesand quadrangles detection, in terms of document classification accuracy,in both the client-side and server-side embodiments. The discloseddocument-recognition method also outperformed the method that excludedlines and quadrangles detection, in terms of document location accuracy,in both the client-side and server-side embodiments. Notably, the Awalet al. method does not perform document location. In addition, as shownin Table 3, the disclosed document-recognition method outperformed theAwal et al. method in terms of average processing time, in both theclient-side and server-side embodiments. Thus, the discloseddocument-recognition algorithm outperformed prior algorithms inaccuracy, while also having better computational performance.

4. Example Embodiments

As discussed herein, an embodiment of the disclosed process for documentrecognition is based on representing the image as a constellation offeature points and descriptors. However, in order to produce accuratedistortion parameters, estimations of straight lines and quadrangles areextracted from an input image and used as additional features. In otherwords, the disclosed approach combines fast methods for detectingfeature points with methods for locating lines and quadrangles, asopposed to prior methods which did not use these geometric primitives.While the geometric primitives of lines and quadrangles are lessinformative of local features, they are more computationally efficient.In addition, the disclosed process is capable of performing documentlocation and classification simultaneously. The disclosed process alsoenables the matched points, lines, and quadrangles to be combined andused for geometric verification (e.g., using RANSAC). Best alternativeselection criteria may be used, along with methods of estimatingsolution accuracy. Performance results demonstrate that the use ofstraight lines and quadrangles increases the accuracy of documentlocation and results in the disclosed process outperforming priormethods in both classification precision and computational efficiency.Notably, the improvement in computational efficiency enables thedisclosed process for document recognition to be performed on lesspowerful devices (e.g., mobile devices), on which document recognitionmay not otherwise be feasible.

The above description of the disclosed embodiments is provided to enableany person skilled in the art to make or use the invention. Variousmodifications to these embodiments will be readily apparent to thoseskilled in the art, and the general principles described herein can beapplied to other embodiments without departing from the spirit or scopeof the invention. Thus, it is to be understood that the description anddrawings presented herein represent a presently preferred embodiment ofthe invention and are therefore representative of the subject matterwhich is broadly contemplated by the present invention. It is furtherunderstood that the scope of the present invention fully encompassesother embodiments that may become obvious to those skilled in the artand that the scope of the present invention is accordingly not limited.

Combinations, described herein, such as “at least one of A, B, or C,”“one or more of A, B, or C,” “at least one of A, B, and C,” “one or moreof A, B, and C,” and “A, B, C, or any combination thereof” include anycombination of A, B, and/or C, and may include multiples of A, multiplesof B, or multiples of C. Specifically, combinations such as “at leastone of A, B, or C,” “one or more of A, B, or C,” “at least one of A, B,and C,” “one or more of A, B, and C,” and “A, B, C, or any combinationthereof” may be A only, B only, C only, A and B, A and C, B and C, or Aand B and C, and any such combination may contain one or more members ofits constituents A, B, and/or C. For example, a combination of A and Bmay comprise one A and multiple B's, multiple A's and one B, or multipleA's and multiple B's.

What is claimed is:
 1. A method comprising using at least one hardwareprocessor to: receive an input image; extract a plurality of inputkeypoints from the input image; calculate an input descriptor for eachof the plurality of keypoints; match the plurality of input keypoints toa plurality of reference keypoints in a reference database, based on theinput descriptor calculated for each of the plurality of inputkeypoints, to identify one or more classification candidates, whereineach of the one or more classification candidates represents a templateimage of a type of document; determine the type of document in the inputimage and one or more distortion parameters for the document based onthe one or more classification candidates; and output the determinedtype of document and one or more distortion parameters.
 2. The method ofclaim 1, wherein the reference database comprises a plurality of sets ofreference keypoints and descriptors, wherein each of the plurality ofsets represents one of a plurality of template images, and wherein eachof the plurality of template images represents one of a plurality oftypes of document.
 3. The method of claim 2, wherein at least one of theplurality of template images is represented by at least four sets ofreference keypoints and descriptors in the reference database, andwherein each of the four sets represents one of the plurality of typesof document rotated by a different amount of rotation than all others ofthe four sets.
 4. The method of claim 3, wherein the different amountsof rotation for the four sets comprise 0°, 90°, 180°, and 270°.
 5. Themethod of claim 2, further comprising, for each of the plurality oftypes of document: receiving the template image representing that typeof document; extracting a plurality of reference keypoints from thetemplate image; calculating a reference descriptor for each of theplurality of reference keypoints; and storing a compact representationof the template image in the reference database, wherein the compactrepresentation comprises the plurality of reference keypoints and thereference descriptors calculated for the plurality of referencekeypoints.
 6. The method of claim 5, wherein each reference descriptoris stored in a hierarchical clustering tree.
 7. The method of claim 5,wherein extracting a plurality of reference keypoints from the templateimage comprises excluding any keypoints that are within a region of thetemplate image that has been identified as representing a field ofvariable data.
 8. The method of claim 5, wherein extracting a pluralityof reference keypoints from the template image comprises selecting theplurality of reference keypoints by: calculating a score for a pluralityof candidate keypoints using a Yet Another Contrast-Invariant PointExtractor (YACIPE) algorithm; and selecting a subset of the plurality ofcandidate keypoints with highest scores as the plurality of referencekeypoints.
 9. The method of claim 5, wherein calculating a referencedescriptor for each of the plurality of reference keypoints comprisescalculating a receptive field descriptor for an image region around thereference keypoint.
 10. The method of claim 9, wherein each referencedescriptor comprises a vector of binary features.
 11. The method ofclaim 1, wherein the one or more distortion parameters comprise ahomography matrix.
 12. The method of claim 1, further comprisingextracting data from the input image based on the determined type ofdocument and the one or more distortion parameters.
 13. The method ofclaim 12, wherein the extracted data comprises one or more of text, animage, or a table.
 14. The method of claim 1, wherein the one or moreclassification candidates comprise a plurality of classificationcandidates, and wherein the method further comprises using the at leastone hardware processor to: calculate a rank for each of the plurality ofclassification candidates; and select one of the plurality ofclassification candidates based on the calculated ranks, whereindetermining the type of document in the input image comprisesidentifying a type of document associated with the selectedclassification candidate.
 15. The method of claim 14, wherein selectingone of the plurality of classification candidates comprises: selecting asubset of the plurality of classification candidates that have highestcalculated ranks; performing a geometric validation with at least one ofthe classification candidates in the selected subset to identify theclassification candidate as valid or invalid; and selecting one of theclassification candidates, having a maximum calculated rank, from theclassification candidates in the selected subset that are identified asvalid.
 16. The method of claim 15, wherein the geometric validation witheach of the one or more classification candidates comprises: calculatinga transformation matrix that maps input keypoints in the input image toreference keypoints in the template image represented by theclassification candidate; when the mapping is within a predefinedaccuracy, determining that the transformation matrix is valid; and, whenthe mapping is not within the predefined accuracy, determining that thetransformation matrix is invalid.
 17. The method of claim 16, whereinthe transformation matrix is a Random Sample Consensus (RANSAC)transformation matrix.
 18. The method of claim 17, wherein, for each ofthe one or more classification candidates, the transformation matrixtransforms vertices of the at least one quadrangle to corners of thetemplate image represented by the classification candidate.
 19. Themethod of claim 17, wherein, for each of the one or more classificationcandidates, the transformation matrix is constrained by one or both ofthe following: a distance between any two reference keypoints is greaterthan a minimum distance threshold; or the at least one quadrangle isconvex and no vertices of the at least one quadrangle lie outside theinput image by more than a maximum distance threshold.
 20. A systemcomprising: at least one hardware processor; and one or more softwaremodules that, when executed by the at least one hardware processor,receive an input image, extract a plurality of input keypoints from theinput image, calculate an input descriptor for each of the plurality ofkeypoints, match the plurality of input keypoints to a plurality ofreference keypoints in a reference database, based on the inputdescriptor calculated for each of the plurality of input keypoints, toidentify one or more classification candidates, wherein each of the oneor more classification candidates represents a template image of a typeof document, determine the type of document in the input image and oneor more distortion parameters for the document based on the one or moreclassification candidates, and output the determined type of documentand one or more distortion parameters.
 21. A non-transitorycomputer-readable medium having instructions stored therein, wherein theinstructions, when executed by a processor, cause the processor to:receive an input image; extract a plurality of input keypoints from theinput image; calculate an input descriptor for each of the plurality ofkeypoints; match the plurality of input keypoints to a plurality ofreference keypoints in a reference database, based on the inputdescriptor calculated for each of the plurality of input keypoints, toidentify one or more classification candidates, wherein each of the oneor more classification candidates represents a template image of a typeof document; determine the type of document in the input image and oneor more distortion parameters for the document based on the one or moreclassification candidates; and output the determined type of documentand one or more distortion parameters.